Transportation spending is an important part of everyone's daily expenses, although trip modes and purposes differ among people. Whether you drive your own car, or take the metro/shuttle to school or workplace, there are always some special circumstances, such as no parking lots around the destination, long walking distance on rainy days and metro/shuttle’s unsuitable schedule. At those moments, people in NYC(most of them use public transportation, plus so few residents own or need to own a car.) will use taxi service so that they can arrive at the destination on time. Compared with private cars and public transportation, the biggest advantage of this type of service is a combination of high degree of mobility and flexibility. For this project, we will use the dataset of 2016 NYC taxi trips to conduct the research. We are going to figure out the answers of the following questions. NYC has 5 areas,the Bronx, Brooklyn, Manhattan, Queens, and Staten Island.
1.How do different factors affect the trip duration (weather, distance, and weekend or not)?
2.Do the peak hours for the weekday and the weekend distribute differently?
3.What areas have more dense pick-up spots for taxi drivers? Why?
# Import the modules
import glob
import pandas as pd
import numpy as np
from math import sin, cos, sqrt, atan2, radians
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import plotly.graph_objects as go
import datetime as dt
# Check the current path
%pwd
# Change the path
%cd C:/Users/yang7/OneDrive/Desktop/python
# weather dataset import
weather = pd.read_csv (r'C:/Users/yang7/OneDrive/Desktop/python/weather_data_nyc_centralpark_2016.csv')
weather.drop(["maximum temerature", "minimum temperature","precipitation","snow fall","snow depth"], axis = 1, inplace = True)
weather.head()
# 2nd dataset of taxi drives in NYC from 2016 Jan to 2016 June
sample = pd.read_csv (r'C:/Users/yang7/OneDrive/Desktop/python/NYC Taxi Data Set.csv')
sample.head()
sample.shape
# Delete the space in front of the start_date column if any
sample.pickup_datetime = sample.pickup_datetime.str.lstrip()
# Extract a part of date as a new list for future use
# Convert all kinds of object in pickup time column to datetime with one uniform format
import datetime
taxiDate = []
for i in range(1048575):
if sample.iat[i,2][1]=='/' and sample.iat[i,2][3]=='/' :
taxiDate.append(datetime.datetime.strptime(sample.iat[i,0][0:8], '%m/%d/%Y').strftime('%d-%m-%y'))
elif sample.iat[i,2][2]=='/' and sample.iat[i,2][4]=='/' :
taxiDate.append(datetime.datetime.strptime(sample.iat[i,0][0:9], '%m/%d/%Y').strftime('%d-%m-%y'))
elif sample.iat[i,2][2]=='/' and sample.iat[i,2][5]=='/' :
taxiDate.append(datetime.datetime.strptime(sample.iat[i,0][0:10], '%m/%d/%Y').strftime('%d-%m-%y'))
else:
taxiDate.append(datetime.datetime.strptime(sample.iat[i,2][0:10], '%d-%m-%Y').strftime('%d-%m-%y'))
# Generate three columns of pickup hour, date and the day of that day.
sample=sample.assign(pickup = ' ')
sample=sample.assign(date = ' ')
sample=sample.assign(day = ' ')
for i in range(1048575):
sample.iat[i, -3] = sample.iat[i,3][-5:-3]
sample.iat[i, -2] = taxiDate[i]
sample.iat[i, -1] = datetime.datetime.strptime(sample.iat[i,-2], '%d-%m-%y').strftime('%A')
sample.head()
# Join two tables
data=pd.merge(weather, sample, on='date', how='right')
# Create a variable for weekend/weekdays
data=data.assign(weekend = 'weekdays')
for i in range(1048575):
if data.iat[i, -2] == 'Saturday' or data.iat[i, -2] == 'Sunday':
data.iat[i, -1]= 'weekends'
# define the distance function by lans and lons
def distance(lat1,lon1,lat2,lon2):
# approximate radius of earth in km
R = 6373.0
lat1 = radians(lat1)
lon1 = radians(lon1)
lat2 = radians(lat2)
lon2 = radians(lon2)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return distance
# Create a new column and convert the lons and lats to floats
data=data.assign(distance_km = 0)
data[["pickup_longitude", "pickup_latitude","dropoff_longitude","dropoff_latitude"]] = data[["pickup_longitude", "pickup_latitude","dropoff_longitude","dropoff_latitude"]].astype(float)
data.dtypes
# Transform to distance
for i in range(1048575):
data.iat[i, -1]=distance(data.iat[i, 8],data.iat[i, 7],data.iat[i, 10],data.iat[i, 9])
# Drop the useless cols.
data.drop(["store_and_fwd_flag", "vendor_id","id","dropoff_datetime"], axis = 1, inplace = True)
data.dropna()
data.head()
# Import the 3rd dataset that has zipcode and locations
zipLocation = pd.read_csv('https://gist.githubusercontent.com/erichurst/7882666/raw/5bdc46db47d9515269ab12ed6fb2850377fd869e/US%2520Zip%2520Codes%2520from%25202013%2520Government%2520Data')
# Change the column names for the join process
zipLocation.columns = ['ZIPCODE', 'LAT','LONG']
# Import the 4th dataset that has NYC restaurants' inspection record
# We will only use the zipcode to locate and calculate the sum of restaurants in that area
restaurants = pd.read_csv (r'C:/Users/yang7/OneDrive/Desktop/python/New_York_City_Restaurant_Inspection_Results.csv')
restaurants.head()
# Delete the extra records of the same restaurants
restaurants = restaurants.drop_duplicates(subset=['CAMIS'], keep='first')
restaurants = restaurants[['CAMIS','DBA','ZIPCODE']].copy()
# Drop the rows with missing value
restaurants.dropna()
# Calculate the num of restaurants by zip code
restaurantsZip=restaurants.groupby(by='ZIPCODE')['DBA'].describe()
# Prepare for join
restaurantsZip=restaurantsZip.reset_index()
# Join
restaurantsLocation=pd.merge(restaurantsZip, zipLocation, on='ZIPCODE', how='left')
# Filter the zipcode with too less resturants
restaurantsLocation=restaurantsLocation[restaurantsLocation['count']>5]
# Convert the count column data type.
restaurantsLocation['count'] = restaurantsLocation[['count']].astype(float)
# Make a subset copy to use in this question
regression=data[['average temperature', 'distance_km', 'weekend','trip_duration','passenger_count']].copy()
regression.head()
# Determine the independent variables
cols = ['average temperature', 'distance_km', 'weekend','passenger_count']
cat_cols = ['weekend']
# Set up variables for multiple linear regression model
X = pd.get_dummies(regression[cols], columns=cat_cols, prefix='', prefix_sep='', drop_first=True)
X = sm.add_constant(X)
y = np.log(regression['trip_duration'])
pd.concat([X, y], axis=1).head()
# Fit linear regression model and report results
model = sm.OLS(endog=y, exog=X)
results = model.fit()
results.summary()
# Plot the residual information
plt.figure(figsize=(8,8))
sns.set(style="whitegrid")
ax = sns.scatterplot(x=results.fittedvalues,y=results.resid, marker='o', data=regression)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
As above, we tried log-level regression model. First of all, we can tell that R-sqaure is very small at 0.32, which means that only 32% of variation of trip duration can be explained by our model. Second, although the P-values of the all variables are extremly close to zero, which means the coefficents are significant, the coefficients of passenger counts and weather are pretty small. Third, the residual plots showed many outliners in the dataset. It is supposed to show a bunch of random points around 0. However, we can see over 10 points with an exteme error values. Therefore, we need to adjust the considered variables, also try to remove out liners and rerun the model.
# Make a copy
noOutlier=data
# Define a funtion to remove the outliers
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
# Apply to the variables
noOutlier=remove_outlier(noOutlier,'average temperature')
noOutlier=remove_outlier(noOutlier,'passenger_count')
noOutlier=remove_outlier(noOutlier,'trip_duration')
noOutlier=remove_outlier(noOutlier,'distance_km')
noOutlier.shape
cols = ['distance_km', 'weekend','passenger_count']
cat_cols = ['weekend']
# Set up variables for multiple linear regression model
X = pd.get_dummies(noOutlier[cols], columns=cat_cols, prefix='', prefix_sep='', drop_first=True)
X = sm.add_constant(X)
y = np.log(noOutlier['trip_duration'])
# Fit linear regression model and report results
model = sm.OLS(endog=y, exog=X)
results2 = model.fit()
results2.summary()
# Plot the residual
plt.figure(figsize=(8,8))
sns.set(style="whitegrid")
ax = sns.scatterplot(x=results2.fittedvalues,y=results2.resid, marker='o', data=noOutlier)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
results2.params
#Confidence intervals ([0.025, 0.975])
results2.conf_int()
#Calcualte the error bar lengths for confidence intervals.
err_series = results2.params - results2.conf_int()[0]
err_series
We can tell the errors are very small,which means the regression coefficient is measured precisely and the impact on dependent variable is different from 0.
coef_df = pd.DataFrame({'coef': results2.params.values[1:],
'err': err_series.values[1:],
'varname': err_series.index.values[1:]
})
coef_df
# Plot the coeffients with their error bar
fig, ax = plt.subplots(figsize=(8, 8))
sns.set(style="white")
coef_df.plot(x='varname', y='coef', kind='bar',
ax=ax, color='none',
yerr='err', legend=False)
ax=sns.scatterplot(x=pd.np.arange(coef_df.shape[0]),y=coef_df['coef'], marker='o', data=coef_df)
ax.set_ylabel('')
ax.set_xlabel('')
ax.axhline(y=0, linestyle='--', color='b', linewidth=2)
ax.xaxis.set_ticks_position('none')
ax.set_xticklabels(['distance_km', 'passenger_count', 'weekends'],
rotation=0, fontsize=16)
After re-run the same model, we get a much better residual plot roughly within range of -6 to 2. We can tell when the expected tripduration increases, we can get a more accurate result. When the expected tripduration is under 6, there is a bigger chance that the actual duration is shorter.
Look at the statistic result, all P-values look good to make sure coefficents in the model are significantly different from 0.
When distance increase by 1 km, the trip duration will increase 32.89%.
When passengers number increases by 1, the trip duration will increase 3.12%.
When it is weekend and other variable remain the same, the trip duration will decrease 13.38% compared to weekdays.
The coefficent of average temperature is still very small, so I deleted it from the model.
Although the R-aquare is only 40.7%, I think the independent variables do have a impact on trip duration, especially the direct distance between pick-up and drop-off spots and wether it is weekend or not. We just need more information of more potential factors to make it better
noOutlier.shape
noOutlier.head()
# Convert the pickup column into a datetime
for i in range(798405):
noOutlier.iat[i,-4]=dt.datetime.strptime(noOutlier.iat[i,-4], '%H')
# Sort by the pickup hour
peak=noOutlier.sort_values('pickup')
# Make a dateframe of every day(monday/tuesday/...) and time and there passenger count information
peak.pickup=peak.pickup.dt.strftime("%H:00")
allPeak=peak.groupby(by=['pickup','day'])['passenger_count'].describe()
allPeak=allPeak.reset_index()
allPeak=allPeak.assign(weekend = 'weekdays')
for i in range(168):
if allPeak.iat[i, 1] == 'Saturday' or allPeak.iat[i, 1] == 'Sunday':
allPeak.iat[i, -1]= 'weekends'
plt.figure(figsize = (16,9))
sns.set(style="darkgrid")
palette = sns.color_palette("mako_r", 7)
# Plot the overall peak hours changes in a graph
# Use different ways to show weekdays or weekends
# Use different color to show different days
overall=sns.lineplot(x='pickup', y="count",
hue='day',style='weekend',
palette = palette,markers = ["o", "s"],
data=allPeak)
for item in overall.get_xticklabels():
item.set_rotation(60)
plt.title("Peak Hours from Mondays to Sundays", fontsize = 15)
plt.xlabel("Pick-up Time", fontsize = 15)
plt.ylabel("The total of passengers", fontsize = 15)
plt.show()
From the Plots we can tell that from Mondays to Thursdays the numbers of passengers are follow the same patten/trend.
Friday nights around 11pm the passengers remain the same level instead of decrease like weekdays.
The demand of taxis at midnight goes up since Thursday night and then gets much bigger on weekends.
No matter for which day, the lowest demand is around 4am and the highest demand is around 7pm. However, for weekdays the second biggest peak os around 8 am, when is a reasonable time for going to work. For weekends, there are other two small peaks. One is around 1am, when people hang out, and the other is 1pm, probably implies the trips for lunch.
Saturdays and Sundays have very similar trends overall except the night. They both have a higher demand from 12am to 4am than weekdays.
# Your mapbox token
mapbox_access_token = 'pk.eyJ1Ijoid2FueXVuLXlhbmciLCJhIjoiY2syb3E4cTU5MTZhbDNtbzNyejRxZDAzbSJ9.V9aZq1zuZ7bovxHrjfce6g'
# Add a dummy variable to the DataFrame called "sample" that tags every 20th row to be included in the sample
# 10% of total data size
mapSample=noOutlier.assign(sample = 0)
mapSample.shape
# Assign random numbers
np.random.seed( 30 )
mapSample['random number'] = np.random.randint(0,1000,size=(len(mapSample),1))
mapSample=mapSample.sort_values('random number')
# Random sampling
for i in range(0,len(mapSample)):
if i % 20 == 0:
mapSample.iat[i, -2] = 1
mapSample
mapSample=mapSample[mapSample['sample']==1]
mapSample['pickup_longitude'].describe()
# the mean longitude of the sample is - -73.979750, which will be used as zoom center later.
mapSample['pickup_latitude'].describe()
# the mean latitude of the sample is 40.753085, which will be used as zoom center later.
mapSample.shape
taxi_map_data = go.Scattermapbox(
lon = mapSample ['pickup_longitude'],
lat = mapSample ['pickup_latitude'],
mode = 'markers',
marker = dict(
color = 'lightskyblue',
symbol = 'circle',
opacity = .5
),
name = "Taxi pickup locations"
)
taxi_map_data2 = go.Scattermapbox(
lon = restaurantsLocation['LONG'],
lat = restaurantsLocation['LAT'],
mode = 'markers',
text = restaurantsLocation['count'],
hoverinfo='text',
marker = dict(
color = 'pink',
size = restaurantsLocation['count']/30,
symbol = 'circle',
opacity = .9,
),
name = "Restaurant density by zipcode "
)
taxi_map_layout = go.Layout(
title = 'Taxi Pickup Locations & Restaurants Density in NYC (Size: The number of restaurants)',
mapbox=go.layout.Mapbox(
accesstoken=mapbox_access_token,
zoom=1
)
)
taxi_map = go.Figure(data=[taxi_map_data,taxi_map_data2], layout=taxi_map_layout)
taxi_map.update_layout(
hovermode='closest',
mapbox=go.layout.Mapbox(
accesstoken=mapbox_access_token,
bearing=0,
center=go.layout.mapbox.Center(
lat=40.753085,
lon= -73.979750
),
pitch=0,
zoom=10
)
)
taxi_map.show()
For the pickup locations, we have to use random sampling to decrease the dataset we got to 5%, since the sample size was too big to run the mapplot feature here. After sampling, the sample size bacame 79841. The red points in Manhattan area are densest. Also, the Brooklyn area has many spots along the river. Some points are around the LGA airport.
From the images, we can clearly tell that the southern part of Manhattan has both more restaurants, which is a likely factor that shapes the pattern of the taxis' pick up locations. Other pick up locations that are not in Manhanttan also tend to be within areas that have more resturants.
# Zoom in Manhattan to look at the pattern closer
# The center is the central park in NYC
taxi_map2 = go.Figure(data=[taxi_map_data,taxi_map_data2], layout=taxi_map_layout)
taxi_map2.update_layout(
hovermode='closest',
mapbox=go.layout.Mapbox(
accesstoken=mapbox_access_token,
bearing=0,
center=go.layout.mapbox.Center(
lat=40.778424,
lon=-73.96175
),
pitch=0,
zoom=12
)
)
taxi_map2.show()
Within Manhattan, the sourthern part is more packed than northern part in general. The most popular pick up spots are located in the several blocks at south of the Center Park.